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Abstract. The LASSO is a recent technique for variable selection in the regression model 

V = X/3 + e, 

where X £ M nxp and e is a centered gaussian i.i.d. noise vector Af(0,a 2 I). The LASSO has been 
proved to perform exact support recovery for regression vectors when the design matrix satisfies certain 
algebraic conditions and (3 is sufficiently sparse. Estimation of the vector Xj3 has also extensively been 
studied for the purpose of prediction under the same algebraic conditions on X and under sufficient 
sparsity of /3. Among many other, the coherence is an index which can be used to study these nice 
properties of the LASSO. More precisely, a small coherence implies that most sparse vectors, with less 
nonzero components than the order n/log(p), can be recovered with high probability if its nonzero 
£ — ■ components are larger than the order <j^/log(p). However, many matrices occuring in practice do 

not have a small coherence and thus, most results which have appeared in the litterature cannot be 
applied. The goal of this paper is to study a model for which precise results can be obtained. In the 
proposed model, the columns of the design matrix are drawn from a Gaussian mixture model and the 
coherence condition is imposed on the much smaller matrix whose columns are the mixture's centers, 
instead of on X itself. Our main theorem states that Xj3 is as well estimated as in the case of small 
coherence up to a correction parametrized by the maximal variance in the mixture model. 
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1. Introduction 

The goal of the present paper is the study of the high dimensional regression problem y = Xj3 + z, 

£\\ | where X 6 M nxp , with p S> n and z ~ J\f(0,a 2 I n ). For simplicity, we will assume throughout this 

■ paper that the columns of X have unit ^-norm. This problem has been the subject of a great research 

activity. This high dimensional setting, where more variables are involved than observations, occurs 

in many different applications such as image processing and denoising, gene expression analysis, and, 

after slight modifications, time series (filtering) |17j . |2U| . machine learning and especially graphical 

\ models |19| and more recently, biochemistry [T] . One of the most popular approaches is the Least Angle 

Shrinkage and Selection Operator (LASSO) introduced in [23j for the purpose of variable selection. 

The LASSO estimator is given as a solution, for A > 0, of 
• i— i ■ 

^ : (1.1) P = Argmin \\\y - Xb§ + A||6||i. 

Conditions for uniqueness of the minimizer in this last expression are discussed in [16], |21| and |15| . 
Several other estimators have also been proposed, such as the Dantzig Selector [3] or Message 
Passing Algorithms |14j . In the sequel, we will focus on the LASSO due to its wide use in various 
applications. 

One of the most surprising and important discoveries from these recent extensive efforts is that, 
under appropriate assumptions on the design matrix X, and for most regression vectors /3, the support 
of (3 can be recovered exactly when its size is of the order n/log(p); see [3], [5], [9], |28j for instance. 
Moreover, under similar assumptions, the prediction error can be controlled adaptively as a function 
of the sparsity of (3 and the noise variance; see for instance |§J. Similar rates can be achieved by 
other method, involving for instance penalization, but the main advantage of the LASSO over most 
competitors is that a solution can be obtained in polynomial time, following the definition of complexity 
theory. A very efficient algorithm is, e.g., [2]. Many implementations are available on the web. 

The two main assumptions for achieving these remarkable results are unavoidably imposed on the 
design matrix X and on the regression vector. 
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• The regression vector (3 is assumed to be s-sparse, with support denoted by T, meaning that 
no more than s of its components are non zero. This can be relaxed to (3 assumed only 
compressible, that is approximable by a sparse vector. 

• The design matrix is assumed to satify one of many proposed algebraic conditions in the 
litterature, implying that all singular values of X$ are close to one for any or most given 
S C {1, . . . , n} with |5| = s' for some appropriate choice of s' (often equal to s of 2s). 

Concerning the second point, two main assumptions have been proposed in the litterature. The first 
is the Restricted Isometry Property |10| [8], which requires that 

(i-2) (i-Ws\\l< WXsPsM <(1 + Ws[|2, 

for S C {1, . . . , n} with |5| = s' and all f3 6 M p . This property is satisfied by high probability for most 
random matrices with i.i.d. entrees with variance 1/n such as Gaussian or Rademacher variables and 
for s' < C r ip n /^og(p), where the constant C r i p depends on the distribution of the individual entrees. 
Notice that the 1/n assumption on the variance and standard concentration bounds imply that the 
resulting random matrix has almost normalized columns and the normalized avatar will satisfy RIP 
with unessential modifications of the constants. The RIP has been extensively used in signal processing 
after the emergence of the so-called Compressed Sensing paradigm [7]. 

The second assumption which is often considered is the Incoherence Condition, which requires that 

n(X) = $x\{X j} X f )\ 
3+3 =1 

is small, e.g. /u(X) < C il /\og{p) as in [9], which is garanteed for random matrices with i.i.d. gaussian 
entrees with variance 1/n in the range n > Cj c log(p) 3 . 

The main advantage of the Incoherence Condition over the Restricted Isometry Property is that it 
can be checked quicky (in p(p — 1)/2 operations), whereas no-one knows how to check the RIP without 
enumerating all possible supports S C {l,...,n} with cardinal s' . Such an enumeration would of 
course take an exponential amount of time to establish. The main relationship between IC and RIP 
is that it can be proved that under IC, f j 1 . 2 j) holds, not for all, but for most supports S C {1, . . . ,re} 
with cardinal s', where s' < C s p/(\\X\\ log(p)), for some constant C s controlling the proportion of 
such supports. 

The objective of the present paper is to extend the analysis based on the Incoherence Condition to 
more general situations where X may have a lot of very colinear columns. The main idea is to assume 
that the columns are drawn from a mixture model of K clusters, and that the set of cluster's centers 
form a matrix which satisfies the Incoherence Condition. 

2. Main results 

2.1. The mixture model. In order to relax the Incoherence Condition, one needs a model for the 
design matrix X allowing for a certain amount of correlations between columns while keeping some of 
the algebraic structure in the same spirit as fjl .2 j) for at least most supports indexing a subset of really 
pertinent covariates. In what follows, we study such a model, where the columns can be considered 
as belonging to a family of clusters and the cluster's centers or (an empirical surrogate) is defined to 
be the pertinent variable. This model is of great interest when many columns are very colinear. In 
practice, one often observes that the columns of X can be grouped into different clusters such that 
the dot product of Xj and Xf j ^ j' is close to one if they belong to the same cluster, and very 
close to zero otherwise. Notice that applying the LASSO for such designs will eventually result into 
grossly incorrect variable selection. On the other hand confusing a variable for another very correlated 
variable might not be a real issue as far as prediction is concerned if the clusters are well separated. 

2.1.1. Detailed presentation. Let K be the number of clusters in the covariates. Consider a matrix £ 
in MJ lxK , with small coherence. The columns of the matrix £ will be the "centers" of each cluster, 
k = l,...,K. 

The design matrix will be assumed to derive from a matrix X Q whose columns are drawn from the 
following procedure. Let /C be randomly drawn among all index subsets of {1, . . . ,K} with cardinal 
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s* with uniform distribution. We then assume that, conditionally on /C each column of X Q is drawn 
from a mixture <3? of K n-dimensional Gaussian distributions, i.e. 



k£K 

where 



<t>k(x) = r^ ex P 



i (j- ii 2 " 



(2^s 2 )2 ^ 25 2 J 

and TTk > 0, G /C and Ylkeic^k = We wm denote by the random number of columns in X Q 

that were drawn from A^(£fc,s 2 Id), fc = 1, . . . , K . Thus, ^fce£ n k = P- 

Finally, the matrix X is obtained by column-wise normalization of X Q , i.e. Xj = X Q j /\\X 0t j \\2- 
Notice that the model could easily be modified in order to more general distributions for fC than 

the uniform distribution on subsets of {1, . . . ,K a } with cardinal s* . 

2.1.2. More notations. For each j G {1, . . . ,p}, denote by kj the index of the Gaussian component 
from which columns j was drawn, and let J k denote the set of indices of columns drawn from the 
k th Gaussian component. For any index set S G {1, ...,p}, let ICs denote the list (with possible 
repetitions) 

K-S = {kj | j G S} . 
The deviation of columns X Q j from center will be denoted by 

£j = X oJ -£ kj ~N(£ k] ,s 2 ) . 

and the matrix E is defined as 

E = ( £ jj)ie{i,...,n}, je{i,..., P }- 

2.2. A simple proxy for j3. For each k G {1, . . . , K}, let be the best approximation of the center 
from the set of colums Xj, j G Jfc, i.e. 



Moreover, set 



3k = Argmin \\Xj - £ fc || 2 . 
ieJfe 



T* = {j*\ketC}. 



Of course, we have s* = \T*\. 
The vector j3* is defined by 

(2.3) £k t ,/?t* = ^ t /3t- 
A simple expression of f3* can be obtained by taking 

(2.4) = £ ft 

for all G T*. Moreover, this expression is unique whenever Xt* has rank equal to s* . In Section [3J] 
we will show that Xt* is indeed non-singular with high probability under appropriate assumptions on 
T. 



2.3. Main result. 
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2.3.1. Further notations. In the sequel r will denote a constant in (1, 1/4). The constants et v will 
be specified in Assumptions 12.31 below. The constants C„, C spar et C co i will be used in the Assumptions 
below: 

C M = r/(l + a), 
Cspar = r 2 /((l + «)e 2 ), 



Let C x denote a positive constant such that 



2 \ /„,2\ 



2 / „.2 



< < C v — 



V s / \ n 

where G is a n-dimensional centered and unit-variance i.i.d. gaussian vector. Let us further define 



(2.5) r max = 1+5 I + \ ~ log(p) + - log(s) | , 



and 



c c 



(2.6) /i max = - s yy/n + y/s + \j2a\og (p) 



(2-7) cr 2 mSLX = 



1 -*\/" l £i ^ x 11 )' 1 



(2-8) if 2 * = a n Iog(p) 



a (1 - e" 1 )^ " / 1 



i 



/ a. (1— e \ n ( i \n 

(2.9) V m J _ l^M^ L^VF^ 

i-«/»r^)v^= 
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2.3.2. Assumptions. We will make the following assumptions. 
Assumptions 2.1. 

. r e 2-io g ( Q ) -i 

p > max < K , e > . 

and 

'0.2 -r (1 + 1.1 -r + 0.11 -r 2 ) 1.1 -r (1.1 + 0.11 • r) 1 



log(p) > max 



0.1 • (1.1 -r + 0.11 -r 2 ) ' a 



Assumptions 2.2. Assume that £ has coherence //(£) satisfying 
(2.10) < ^ 



log(p) ' 

Assumptions 2.3. There exists a positive real constant •&* and a positive integer v such that 

mm \J k .A >$*log{ P y. 

Assumptions 2.4. 

^* ^ -^Q) C S p ar 
~~ logp ||£|| 2 

Assumptions 2.5. 

OL + 1 , 

2.11 n > log(p). 

c 

Remark 2.1. Assumption \2. 5\ is to be interpreted with care since the order of magnitude of n is 
primarily governed by Assumption \2.2\ on the coherence oft. For instance, if<£ comes from a Gaussian 
i.i.d. random matrix, the coherence will be of the order y / log(K Q )/n as discussed in [9] Section 1.1] 
and n should be at least of the order log(p) 2 log(K a ). Notice that this is still less than if X itself had 
to satisfy the coherence bound, with would implie that n be of the order log(p) 3 . 

Assumptions 2.6. 

Ccol > e 2 (a + l) nicix{ \J C spar •) (-^fj, } • 

and 



(c^+a+Li.,)^) < 1 / **<p)o--'") 2 



2V (alog(p)-bg(2))2' 



Assumptions 2.7. s U/n + y §log(p) + ^log(s)J < 1/2 
(2.12) 5 < C s ' 



y/^gip) (^+ A / £ ^logb)) 
for any C B)n , p such that and 



C s , n ,p < min^O.l- . = ;-ylog(pj 



W. C x ) Vlog(p) 
Assumptions 2.8. 

2 a log (p)n of 



. < j - • - max 



^T" MjLx lo g 2 (p) - 12 C/ /'max ^max S\/™ 

Assumptions 2.9. 

ii «* n2 ^ 2a log (p) n cr max 2 



T 1 * 112 > 



Mmax 2 log 2 (p) - 24 ^ max r max S * / ^Ml^i 



/ f'max ' max « V ^ 0» C x J \ log(p) 
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Remark 2.2. Notice that, using \2.1$ , the following relationship holds between \\f3*T*\\2 an d II/^tII! 
all the coefficients f3j, j 6 Jk have the same sign for all k £ K: 

Wt41 > WMl 

In this case, one can replace 11/3*^*111 by H/StHI ' Ln Assumption \2. 9) and merge Assumption \2. 8\) 
and Assumption \2.9\) by taking the maximum of their respective right hand side and obtain a simpler 
assumption. 

Assumptions 2.10. The support of /3^.* is random and uniformly distributed among subsets of 
{1, . . . ,p} with cardinal s* . The sign of (3^, is random with uniform distribution on {— 1, 1} S * . 

Remark 2.3. This last assumption is a transposition to the proxy (3* of the conditions on /3 in [S]. 

2.3.3. Main theorem. The main result of this paper is the following theorem. 



Theorem 2.4. Set A = 2a v / 2a log(p). A ssume that X is drawn from the Gaussian mixture model 
of Section \2. II with K, drawn uniformly at random among all possible index subsets of {1, . . . , K} with 
cardinal s* . Let Assumvtions[2l[ [Ql fO HO [Ol [221 [23 [23 and\2M hold. Then, we have 



l\\Xhg < s**-u a(~ \ + ^T+r-5\\<L T p T \\^ + l -5 2 \\Xp\\l 
with r* = 1.1 • r (1.1 + 0.11 • r) and for any 5 satisfying 



5 > 4s \^/n~ + y ^ log(p) + - log(s)J (l + 8\/2 v/alogfr) + log(2n + 2)^/^7^) 
+ ^12 Cj 5 V» r max + alogO) /i max ^ a/ s*Pe; 

(2.13) +( a c s /SWypr c ' +,,: - ° io8w ) m 

3. Proof of Theorem 12.41 

Some parts of the proof closely follow the key arguments in the proof of [9, Theorem 1.2]. Their 
adaptation to the present setting is however sometimes nontrivial. We present all the details for the 
sake of completeness. 

3.1. Preliminaries: Candes and Plan's conditions. The following proposition will be much used 
in the arguments. 

Proposition 3.1. We have the following properties: 
(I) 

(344) p - id s n > ri < 2i<i 



2 J ~ p a 
(II) 

(3.15) P(||X^,Xr* > 1.1 t(1.1 + 0.11 t)) < 
(HI) 

(3.16) w(\\X t z\\ 00 >o^2a log(p)) < ^. 



219 
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(IV) 

||Xy*cXr*(Xy*Xr0 _1 ^T* z || oo + A \\X^ t cXT*(X^XT*)~ 1 sgn(f5^*)\\ oo 



(3.17) < ay/1 + 1.1 • r(l.l + 0.11 • rj + - A 

Proof. See Appendix O □ 
3.2. Controlling ||X/3 - || 2 by ||X/3||. 
Proposition 3.2. One has 

F(\\xp-X/3*\\ 2 >5 \\CKrPrh) < \ 

pa 

Proof. The proof is divided into four steps, for the sake of clarity. 
Step 1. Let 

Et = Xt — £)c T - 

where, since JCt is supposed to be a list with possible repetitions, the matrix Cjc T has correspondingly 
possible column repetitions, and 

E T = Xt* - £k t * ■ 

Thus, using (HOI) . 

\\XP-Xp\\ 2 = \\EtPt-E t *(3t42, 
which, by the triangular inequality, gives 

\\XP-XP*\\ 2 < \\E t Pt\\2 + \\Et*Pt* lb- 
Step 2: Control of ||£tAt[|2- The column j G T of the matrix Et has the expression 

E, = - U .. 

We may decompose the quantity ||-Et/3t||2 as 

II -^T/^T || 2 = mi|2 + ||-B||2, 

where 



A = V ( — — 



and 



We have the following bound for A. 
Lemma 3.3. 



B = V— —rEiPi. 



^P|| 2 >4s (^/7i + log(p) + i log(s) j (l + Sv^ >/alog(p) + log(2n + 2) v^V^) \\£k t Pt\ 



(3.18) < C ' + J 



Proof. See Appendix lA.il □ 
Turning to B, we have the following result. 



Lemma 3.4. We /iawe 

P^||-B|| 2 > ^12 Cj s r rs 
Proof. See Appendix IA.21 □ 



+ alog{p) /i max j y^Vcll^/CT^Tlb^ < 



STEPHANE CHRETIEN 

Step 3: Control of \\Ej,*Pt* W2. The column j* G T* of the matrix Ej,* has the expression 

£fc * + Ej* 



We will procede as in Step 2. Define 



1. 



'J* 112 



Notice that £■** can be written 



with 



Et*Pt* = A* + B* 



where 



A* 



c* 







and 



= E^/^B**, where 5,, 



|£fcj* + £j* II 2 



We begin with the study of A*. 
Lemma 3.5. We have 



\A*\\o > 4s\ n 



a(l-e- 1 )\™/ 1 



\og(p) u 1 



< 

- p a 

Proof. See Appendix I A. 31 

Turning to B*, we have the following result. 
Lemma 3.6. We have 



IB* II > 24 r* s 

|_D II ^ I Z,*± / max 3 



1 + 2^/2 ^v / alog(p)+log(2n + 2))J \\€ Kt /3 t \\ 2 
2 



a (1 — e x ) \ " 



1 



+/4ax alog(p) ) y/pC \\£tPt\\2 < 



log(p) u 1 
2 



Proof. See Appendix IA. 41 



□ 



□ 



Step 4: Conclusion. Combining Lemma 13.31 13.41 13.51 and 13. 6[ we obtain that for any S such that 
(I2.13P we have 

1 



PQ\XP-XP*\\2>6\\€tPt\\2) < — • 



□ 
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3.3. The prediction bound. By definition, the LASSO estimator satisfies 

(3-19) l\\ y -xp\\l + AH/311! < I||y-Xri|i + A||rili- 

One may introduce Xf3 in this expression and obtain 

I||y-X/3 + X(/3-/3)||| + A||/3||i < ±\\y-X/3 + X(P-p*)\\l + \\\l3*\\i, 
from which we deduce 

(3.20) l\\X(J3-p)\\l < (y-X(3,X0-f3*)) 



r' M|2 



Set h* := /3 — f3*. Using sparsity of f3*, we obtain that h*x* c = Pt* c — P*t* c = Pt* c - Thus, we have 



11-11011 = W + h\\i-W\\i 

= [|/3V + h* T * ||i + \\/3* T *c + feVHIi - II/3V Ik 
= H/3V + fcV lli- 11/3 V Hi + ||fcV«||i- 

Since, for any b with no zero component, the gradient of || • ||i at b is sgn(6), the subgradient inequality 
gives 

\\/3* T *+h*T*h > ||/3Vlli + (sgn(/3V)^V) 
and combining this latter inequality with ()3.20p . we obtain 

(3.21) l\\ X (f3-p)\\l < (y-X/3,Xh*)-\(sgn(l3* T *),h* T *) 

-\\\h*T*4i + \\\x(P-P*)\\l 

Set r := f3* — (3 and h := f3 — (3. Using these notations, equation (|3,2ip may be written 

(3.22) \\\Xh\\l < (z,Xh*)-\( Sm (P* T *),h* T *) 



-A||/l* T * c ||l + -Il^ r ll2- 



1 

r 

Using the fact that 

{X t z,h*) = (X^z,h* T *) + {X^,cZ,h* T *c) 
and the following majorization based on (|3.16p 

(X^, c z,/i* T * c > < \\h* X* c ||l ||^T* c2; lloo 



we obtain that 



< \ A IKtHIi, 



l -\\Xh\\ 2 2 < ( v X T *)-{l-\)\\\h*T*4i + \\\Xr\\l 



where v := Xt^,z — A sgn (/3* r * 
Now, observe that 



(v,h* T *) = (v^X^Xt^X^X^^t*) 
= {(X T *XT*)~ 1 v,X^X T *h* T *) 

= {(X^Xr^v^X^Xh*) - ((X^X T *)~V X^X T *oh* T *-} ■ 

V „ ' V v ' 

Ai A 2 
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Let us begin by studying A<i- We have that 

A 2 > -||X^ c X T *(X^X T ,)-^||oo||/iT«||l 

> — \\Xt*cXt* {Xt* Xt*)^ 1 X?* zWooWh^cWi 

-A ||X^cX T .(^Jf T .)- 1 sgn(/3V) lUII/iVHIi 

> - r<r-s/l + 1-1 • r (1-1 + 0.11 - r) + hj ||/i*tHIi 

by ff37T7T) . Thus 

(u,/iV) < + (W 1 + 1-1 • r (1.1 + 0.11 • r) + i A J ||/ioHli 
and since, by (|2.10|) . 

(a^l + 1.1 • r (1.1 + 0.11 T) + hj < A, 



we deduce that 



\\\Xh\\l < A x + \\\Xr\\l 



Let us now bound A\ from above. We have that 

Ax < WX^X^W^UX^Xr^vh 

V v ' V v ' 

Firstly, 

Si < WX^iX^-y^ + WX^iXp-y)^ 
< \\X^(Xr-z)\\ oo + \\X!t*(y-X$)\\ 0O 

where we used (|3.16p . and the optimality condition for the LASSO estimator. Secondly, 

< V^\\(X}fi.X T .)- 1 \\\\v\\2 

< a *[|(x^x r .)- 1 ||N| 0O . 

Moreover, (f3~TT>]l gives ||(X^X T ») _1 || < 1.1 • r (1.1 + 0.11 • r) and 



3 

\ v \\oo 

< \\Xt*z\\oo + A < - A 



Thus, we obtain that 



and thus, 



A 1 < a*l.l-r (1.1 + 0.11 t)| A Q A+HX^XrlU J 



-\\Xh\\i < r (1.1 + 0.11- r)-A(-A + ||X^Xr|| 00 )+-||Xr||^. 



Since ||jr£».X'r|| 00 < ||X*,«Xr|| 2 and since ||X*,„Xr|| 2 < y/1 + 1.1 • r (1.1 + 0.11 • r)||Xr|| 2 , we obtain 

ill^Hl < s*l.l • r (1.1 + 0.11 • r) | A Q A + ^1 + 1.1-r (1.1 + 0.11 • r)||-Xx|| 2 ) + \\\Xr\\ 2 2 . 
Moreover, Proposition 13.21 yields 

\\\Xhf 2 < 8*1.1 ■ r (1.1 + 0.11 • r) ~ A Q A + ^1 + 1.1 • r (1.1 + 0.11 • r) 5||e: T /3 r || 2 ) + ^ 2 ||X/3|| 
which completes the proof. 
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Appendix A. Technical lemm^e 
A.l. Proof of Lemma 13.31 We have that 

\\£kj 1 1 2 — [ 1 2 < ||£fcj + Ej\\2 < \\£kj 1 1 2 + Il-Ej'lb- 
Moreover, since H-EjUl/s 2 follows the ^-distribution, the scalar Chernov bound gives 



ii 



(A.23) 



\E 



'J 112 



> u^j < C exp (— cu 2 ) 



for some constants c and C . Let Wj denote the following variable. 



\£kj + Ej\\2 



and let £ n denote the event 



n ieT < 



5 (V^+^f log(p) + ilog( S )) 



< 



1+s {yn+^jz log(p) + ilog(s) 

s (\/^+ y f log(p) + i log(s) 



Taking u = \ — log(p) + = log(s), we obtain that 



1-s ^ + ./^log(p) + Ilog( S ) 



C_ 



'(S a ) > 1 



On the other hand, we can write \\A\\2 as 



\A\ 



where Aj is the matrix 



A d = Wj 



o 



Thus, by the triangular inequality, we have 

(A.24) ||A|| 2 < |£Af-E[^|£ a ] + || ^E[A,- | £ Q 

and we may apply the Matrix Hoeffding inequality recalled in Appendix IB.2I We have that 



\Wj\\P 3 \ 



which implies that, on £ a , we have 



\Aj-E [Aj | S a ] || < 2 ^ V . ==^\p ju 



1-5 (v^+^f log(p) + i log(s)) 



which, by Assumption 12.71 gives that 



|.l,-E[.l,- |£,]|| < 4s ( \/w + log(p) + ^ log(s) j |/3d. 
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The Matrix Hoeffding inequality, first applied to the sum and then to its opposite, yields 



J^Aj -E[Aj | £ a ] h>u \ £ 

( 

(A.25) < 2(n + 1) • exp 



u 



\ 



8 -16 s 2 (^+^tlog(p) + Mog( S )) ll^Hiy 



On the other hand, we have that 

||1>[^I£J 



fyj 



Notice that, since ||2 = 1, then, +E , j|| 2 >/s 2 has a noncentral-^ 2 distribution with non-centrality 
parameter equal to 1/s 2 , for all j € T. Thus, we deduce that all the variables 1 1 CEjfe - + j G T, 

have the same distribution and in particular, the same conditional expectation given £ a . Therefore, 



5>[A,-|5 Q ] 



mwi\£ a ]\ ^ 

|E[Wi |£ a ]| ||^ t /3t|| 2 . 



<4/, 



But since 



we obtain 



|E [IT, | <5,,] £ la j v^ + \/^ log(p) + ilog(s) 



4s vn 



Combining this latter inequality with (|A.25|) . (jA.24j) becomes 



a 1 \ 

-Iog(p ) + -log s ||£x: T ^r||2. 

C CI 



2 > Is ( A + log(p) + ^ log(s) j ||^ T /3 T || 2 + u I 5 Q 



/ 



< 2(n + 1) • exp 



\ 



Since, for any event A, 



\ 8 -16 s 2 ^/n + y?log(p) + ilogOOJ ||#r||I/ 
P(Vl) < P(A|f a )+P(fS). 



we obtain that 



2 > is J v '/> + \/ - log(p) + i log(s) j ||£x: t ^t||2 + u 



< 2(n + 1) • exp 



Let us now choose u such that 

/ 

2(n + 1) • exp 



I 

u 

^ 8 -16 s 2 (v^+Vf log(p) + ilog( S )) p T |ll / 



IT 



^ 8 -16 s 2 U/n + ^/f log(p) + ilog( S )J ||/3 T || 2 y 
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i.e. 



v 



8^2s Lfti + yj^ log(p) + - c log(s)^ ||/3 T || 2 V«log(p) + log(2n + 2)). 



Therefore, we obtain that 



- lib > Ls ( v + log(p) + ~ c log(s) j f \\<tK T PTh 



(A.26) < 



+8^2 Valog(p) + log(2n + 2)||/3 T || 2 
C + l 



Recall that we assumed the /3j associated to the same cluster to have the same sign. Thus, we obtain 
that 

IIArlli = Pt* Hi < v^H/3^112, 
and using the version of the Invertibility Condition for (£ (|3.14p . we get 



PtIIi = V^Pz \\£k* t Pt42, 

and thus, 

||#r||2 = s/s*p~z \\£K* T pT*h 

and, using the definition of 0t* , 



(A.27) ||/3 T || 2 = yft*Pe WCKrPrh- 

Thus, (lAT26]l gives 



^||A|| 2 > 4s ^Vn + ^~ log(p) + ~ log(a) j (1 + 8^2 Valog(p) + log(2n + 2) V / P^) ||<£jc t /3t| 



(A.28) < C ' +J 



A.2. Proof of Lemma H3J Recall that 



| S| | 2 = II y — ^--e 

ll^||£ fej +^|| 2 ' 



Will be use Talagrand's concentration inequality and Dudley's entropy integral bound to study ||-B|| 2 . 
We start with some preliminary results. 

A. 2.1. Preliminaries. Let us define the following event: 

T a = £«n{||££|| <s(v^+^+ \/2alog(p))}. 
Since Ej, is i.i.d. with Gaussian entrees A/" (0,5), Theorem IB.5I in Appendix IB. 41 gives 



Et\\ >s(v^+v^+\/2alog(p)jj < — . 
Thus, the union bound gives that P (J- a ) > 3/p a . Let us now turn to the task of bounding ||-B||2- 
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A. 2. 2. Concentration of \\B\\^ using Talagrand's inequality. Notice that on J- a , we have 

ft 



-Ej 

jeT Mi + E jh 



< max 
b 




2 


jeT 



where the maximum is over all 
be 



1-5 I v / n + W-log(p) + -log(s) I ,1 +s I Vn + W-log(p) + -log(s) 



a 



Thus, on J^, 



|5[| 2 < max <>, y^ ^-Eij), 

0, IB 2=1 T - 



the main advantage of this former inequality being that of involving the supremum of a simple Gaussian 
process. Now, we have 



\Bh-E 



< 



, w 2=1 r - ^, o 



J6T 



> It I Tn 



max (w, 4^ 
b,|M| 2 =i ^— ^ 6 • 



E 



jeT 



Let 



.1/, 



max (w, V] x^i) I 

,11-112 = 1 ^ 6 



> u I Jv, 



ier 



In order to apply Talagrand's concentration inequality, we have to bound the M bw on £ a , and its 
conditional variance given T a . First, by the Cauchy-Schwartz inequality, we have 



and thus, 



jeT 



M b , w < -||/3 T || 2 ||£;^|| 2 



< J| 



Et\\\\w\ 



Thus, on F a , using the fact that ||to||2 = L we have 

max \\t J l II 2 1 

where /i max is given by (|2.6p . Let us now turn to the conditional variance of M b ^ w given T a . We have 

Var (M bjW | T a ) = £ ^Var (tfw \ F a ) , 
jeT 

and, using the Cauchy-Schwartz inequality again, we obtain 

Wfrh 



Var {M b , w | F a ) 



Var 2 {Ejw \ F a 
jeT 



On the other hand, notice that, conditionally on E l -w is centered. This can easily be seen from the 
invariance of both the Gaussian law and the event J- a under the action of orthogonal transformations. 
Therefore, we have 



Var(£» > Var (Ejw \ F a ) (l - A) . 
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Moreover, using the fact that ||io||2 = 1, 

Vav(E t j w) = s 2 . 

Therefore, 



15 



Var {M b>w | T a ) 
Using the lower bound on b, we finally obtain 



■>T 2- 



Var (M fei „, | F a ) < 

'-''max 1 1 Pt 1 1 2 > 

where <r max is defined by (|2.7p . With the bound on M b)W and its conditional variance in hand, we are 
ready to use Talagrand's concentration inequality recalled in Appendix IB.5I Thus, Theorem IB, 61 gives 



max 

,IM|2=1 A'max ||/#T||2 



> E 



AT*, 



(A.29) 
(A.30) 
with 



7 = n 



Aax ll^rlli 



+ E 



max 

6,||w||2=l ^max ||/*T||2 

< exp (— u) , 



max 

,||w||2=l /^max ||/*T||2 



A. 2. 3. Control of the conditional expectation of maxfo j || u ,|| 2= i — 



I J a 

Notice that 



E 



max (w, ^-Ej 

,11 112 



> E 



max (»,^^)|J a 



1 p° 



Therefore, our task boils down to controling the supremum of a centered gaussian process. For this 
purpose, let w = w/b, which implies that 



E 



max (w, — En 
Mla=l f-t b 3 



b,\\w\ 



E 



max(?i, \ (3jEj 



J6T 



where T denotes the spherical shell between the sphere centered at zero with, radius r max — i + 
s (y/n + yj ^ log(p) + ^ log(s)^ and the sphere centered at zero with radius r mm = 2 — r max . This can 
of course be performed using Dudley's entropy bound recalled in Section IB. 6. 11 In the terminology of 
Section lB.6. 11 the semi-metric d given by 

2- 

d 2 (w,w')) = E [ (w-w'^foEj) 



The variables j3j (w — w') Ej, j 6 T, are centered and have variance equal to s j3j\\w — w'\\^. Thus, 

d(w,w') = S \\Pt\\2 \\w — w'h- 

Let us now consider the entropy. An upper bound on the covering number of T with respect to the 
euclidean distance is given by 

,,, . , f3s /•„;,,• I >T\\2 

H(e,T) < nlog 



Therefore, by Theorem IB. 71 we obtain that 



E 



max (w, y PjEj 



3sr 



max||PT||2 

e 



de, 



with 



o- G = 5 r max \\Prh- 
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Using the change of variable e' = £ Na n , we obtain 



(A. 31) E 
where we recall that 



max (w, PjEj) 



12 Cr 5 r B 



A. 2. 4. Conclusion of the proof. To sum up, combining (|A.30|) and (1A.31P 

(A.32) p(||B|| 2 > 12 Cjsy/nr^ ||#r||2 + /W IIArlb (v 7 ^ + |) I -Fa) < exp(-u), 
with 



7 < n 



C 



+ 12 — i- 5 \J~*T\ Vj] 



Mmax II II 2 Mmax 



Thus, 



jB[| 2 > ( 12 Cj s y/n r max + , /2ncr max + 12 Cjfi max Syfn r max + \i 

max „ I || pT 2 a 



->T 2 



(A.33) < exp (-it) . 

Taking u = alog(p) and using Assumption 12.81 gives 



'MI-BII2 > (l2 Cj 5 y/n r max + a\og(p) 

A'max I 



T 2 -^a 



Using the same trick as before, we have 



'^||£||2 > ^12 Cj s \fn r max + alog(p) /i n 



max \\PT 2 



< 



pa 



Finally, using ()A.27p . we have 



|B|| 2 > ( 12 s 



r max + alog(p) ^ max ^ V^Vcll^T^TlbJ < 



A. 3. Proof of Lemma l3.5L We will use the same arguments based on the Matrix Hoeffding inequality 
as m lATl For this purpose, define 

1 



and wrFite 



w h 



\A* 



|£fe,. +^j*||2 



1 



We will need the following lemma. 
Lemma A.l. Let 



n,* eT * < 



Ej*\\2 <s \ n 



a (1-e- 1 ; 



0. c x ) ViosCp)"- 1 



1 



Then, P (£*) > 1 - 
Proof. See Section TA.3.21 



□ 
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A. 3.1. Control of the deviation of ||A*||2 by the Matrix Hoeffding Inequality. We can write ||^4*||2 as 



£ 4- 



where A** is the matrix 



A* 



Wj* 







Thus, by the triangular inequality, we have 

(A.34) \\a*\\ 2 < J J2 A r ~ E i A r I s *«\ I + 1 E E i A r I e «] 

i*eT* j*eT* 
and we may apply the Matrix Hoeffding inequality again. We have that 

and thus, on <?*, 



\\A' r -W.[A-, \SSi\\ < 2- 



1 — SA/n 



o (1 — e~ 1 ) ^ n 



1 \ « 



log(p)" 



Under Assumption 12.71 we have 



T 

this former inequality becomes 



a (1 — e 1 ) \ n 



log(p) 



< 0.1, 



-E[A}* || < 351 



a (1 -e" 1 ) 



log(p) 1 



Applying the Matrix Hoeffding inequality, we obtain 



(A.35) 



| Y, A],-¥,[A* r \8l]\\ 2 >u\8* a 

( 

< 2(n + 1) • exp 



u 



\ 



9 s 2 n 



a (1-e- 1 ) 



log(p) 



T7=T 



Let us now turn to the expectation term, i.e. the last term in (|A.34p , We have 



| E E i A r i ^ 



j*eT* 







'j* 



< 



E 3s i 

j*GT* 



n 



a (1 — e 

Cv 



1 



log(p)^ 1 



3* \\2 



This last inequality, when combined with ()A.35|) and ()A.34p . implies 



IP ||A*|| 2 > 3s\ n 



a (I- e" 1 ) 
i?* C x 



log(p) 1 



- \\CK T *PT*h+u\S* 



< 2{n + 1) • exp 



u 



9 s 2 n 



« (1— e x ) \ n 

"XcT ' 



log(p) 



1 \ n \ \\R* ||2 

h>T* 1 1 2 
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from which we deduce, by the same trick as in Section IA,1| that 



\\A*\\ 2 > Ss\ n 



a (1 — e 1 )\ n 



log(p) 1 



r \\£k t *Pt*\\2 + u 



< 2(n + 1) • exp 



a (1-e n / i \» \ up* ,, 2 



• 9 s 2 n 



log(p)"- 1 



j t* ii2 y 



+ 



Let us now choose u such that 

/ 

2(n + 1) • exp 

V 

i.e. 



u 



•9s 2 n 



a (1-e -1 ) \ n I i 



3* ||2 
T* 1 1 2 



8\/2 sy n 

Therefore, we obtain that 



a (1 — e \ " 



log(p) 1 



.|| 2 Valog(p) + log(2n + 2)). 



\\A*\\ 2 > Ss^jn 

+8V2 s\ln 



a (1 — e x ) \ n 



log(p) 1 



a (1 - e- 1 ) 



1 



log(p)^ 1 



T T « || 2V / alog(p) + log(2n + 2)) 



2 

< — . 

- p a 

By the (|3.14p . and the definition of f3* , we have 

\\PtA\2 < V^II^T^T'lb, 
= ||£/Ct/3t[|2 

and therefore, we obtain 



P Plk > 3sWn 



a (1 — e 

Cy 



-m ^ 



log(p) 



i/-i 



1 + 2^2 ^V«log(p) + l°g(2n + 2)) ||^ T ^r|| 2 



2 

< — . 

- p a 



A. 3. 2. Proof of Lemma \A.l\ Using the independence of the Ej, j £ we have 



P(||^.|| 2 > u) 



min ||-Ej|| 2 > i* 



= n ^{\\E 3 \\i>u 2 )i 

< P(||^||2> U 2 ) min ^ eT * |J M 



We also have 



|£;lll<0- 



MIXTURE MODEL FOR DESIGNS IN HIGH DIMENSIONAL REGRESSION AND THE LASSO 



19 



On the other hand, as is well known, we have 



< u' 



< C x [- 



2\ n 



for some positive constant C x . Thus, the union bound gives 



max \\Ej* H2 > u ) < s* ( 1 — C 



u 



2 \ ns min^*^. \J k 



n 5' 



Let us tune u so that 



i.e. 



s* 1 - a 



2 \ n 



n 5 Z 



< 



n 5 



Cl 



1 _ ( s *p- a ) mil V^* ^ 



and since min.,* eT * > $*log(p)^, 



n s 



u > — — ( 1 — exp 



O 



log(s*) 



On (0, 1), we have 



and thus, 



u 2 > n s 2 



^logCp)"- 1 i?*log(p)" 
exp(-js) < 1 - (1 - e _1 )z 
a (l-e -1 )\» / 1 



log(p) 1/ 1 



from which the desired estimate follows. 
A.4. Proof of Lemma \3M 

A. 4.1. Concentration of ||2?*||2. We start with the concentration of 



13",. 



. Cfc .„ + En* O 

j*eT* 11 j J 11 



Consider the matrix Etp*, whose columns are independent. We would like to bound its operator norm. 



Lemma A. 2. We have 



\E%»\\ > sK n , 8 . \£*) < ~ a 



where we recall that K n ^ s * is defined by K2. 8\) above. 
Proof. See Section [A. 4. 41 
Define 

T* = s*n{\\E^\\<sK n , a *}. 

Thus, the union bound gives that ¥ (J 7 *) > 3/p a . Let us now turn to the task of bounding ||-E?*||2 
Notice that on J 7 *, we have 



□ 



E 



Xk-* + Ej*h~ 



-E» 



< max 
b 



E -jr E r 



where the maximum is over all 
b 6 1 — s\ n 



a (1 — e 



x 



1 



log^) 1 ' 1 



, 1 +s\ n 



a (1-e- 1 ; 



1 



log(p) l/ 1 
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Thus, on J 7 *, 
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ft. 



< max (w, } -j-Eij*). 
6, w 2=1 b ,J 



Now, we have 



\B*\\n — E 



max 

6,||«'||2 = 



>u\T* 



<P max (w, V -£-Ej*)-E 

\ 6,||w||2=l .^~!L o 



Let 



j * €zT* 



Mi 



max 

&, 111/-' 1 1 3= 



> u i .f: . 



(«, E f 



We again have to bound on J 7 *, and its conditional variance given J 7 *. The Cauchy-Schwartz 

inequality gives 



m 6 % < Jii^ik/E(^) 2 > 



and thus, 



Mt, w < -||^|| 2 ||^|| 2 



< h 



Jf* 1 1 2 1 1 &x 



Et* \\\\w\\o. 



Thus, on J 7 *, using the fact that ||io||2 = L we have 

M b,w - Mmaxll$T*l|2, 

where 

and -?T n . s * is defined by (|2.8p . Let us now turn to the conditional variance of M£ w given J 7 *. 
Lemma A. 3. We have 



where cj^ ax is defined by \2.9i) . 
Proof. See Section TA. 4. 51 



□ 



Using Talagrand's inequality (Theorem IB.6P again, we obtain that 



(A.36) 



M? 

P I max — ,, „ > E 



&,|M|2=1 /^maxll/^T* 1 1 2 

< exp (—u) . 



M, 



max 



b,w 



b,\\w\\ 2 =l Mmaxll^T*ll2 



with 
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b.w 



- . As in Section [Al2j we will use 



A. 4. 2. Control of the conditional expectation o/maxi n«,||,=i — 

Ml IM Mmax \\Pt*h* 

Dudley's entropy integral bound to control the expectation, but this time, the sub-Gaussian version 
of Section IB. 91 Let us rewrite 



E 



max Ml | J"* 

0, \\W 2=1 



E 



max («}, V /3f%*) | J"* 



First, we have to prove the sub-Gaussianity of M? . Notice that, due to rotational invariance of the 
Gaussian measure, conditionally on J 7 *, Ej t w is centered and 



(«5 - w, Pj* E r) >u\T*\ < P | /3]*[w- w'Y Ej. >u\F c 
j*eT* J \j*£T* 



~i\t 



^{.O^D{C)E r )\w -w')>u\F* a 



j* ET * 



where £ is a rademacher ±1 random vector, Oyj—w' i s the orthogonal transform which sends w — w' to 
the vector ||u> — w'l^/v^ 6 ; where e is the vector of all ones. Thus, 



(w-w', Y P**E j *)>u\F* a 



\w — w'\\2 



v j* eT * i=1 



We now study the sub-Gaussianity of Y27=l Ci-^iJ* ■ Using the Laplace transform version of Hoeffding's 
inequality, we have 



E 



exp Tj- 



\W — W 2 



< exp | 77 £ /J sin 



1 



Therefore, using independence of the Ej*'s, we have that 



E 



exp 



rj \\w — w ||2 



E 



E 



—j= — y . 



n e 



77 1 1 10 — 11/ ||2 



E-* T* 
■^3 1 •'a 
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Now Chernov's bound gives 



u i j: 



< e~^ u E 



exp 



ry \\w — to ||2 



7^ — - E ^ V E I J 7 * 



^ I 2 11 ~ ~/||2||o* ||2 „2 a (1 e 1 ) 

< exp i) 1 1 to — to H2II/5 lb 5 



0. C 



X 



1 1 

1 



log(p) 



u-1 



rju 



Optimizing in 77 gives 



v .» er , i=1 



< exp 



\ 



it 



,7,/ 112 



w - W 'ir 2 \\r T * ill * 2 ( 5 (1^) 5 j ; 

Using the union bound and invariance of the bound with respect to sign change, we thus obtain 



(w - w , ^2 /?*, Ej* 

/ , 

< 2 exp — 

V 



>u\ Tl 



w — w 



r,/ll2 



2 c 2 / ofl-e-l) \ " / 1 



12 T* 1 1 2 3 ^ C x 
Thus, the process is sub-Gaussian with the semi-metric d, given by 



log(p)"- 1 ) ) 



... r', Al\.r. .rJ\\2na* ,|2 J2 A* (1 - e X ) \ " / J 



<i (u), u) ) = 4||iZ> — w 



12 II/- 1 T*ll2 



Let us now apply Theorem IB. 91 The diameter of T* is bounded from above by 

1 



/ \log(p)" 

An upper bound on the covering number of 7~* with respect to the semi-metric d is given by 

/. 

H(e,T*) < nlog 



3-2r* 



T*\\2S\j\ # mCx ) I iog(p)-l 



V 

Therefore, by Theorem IB.7| we obtain that 



/ 



E 



ti)6T ' — * 



< 12y/n 



log 



6 r* 



T*\\2 5 



a (l-e- 1 ) 



log(p) 



1 \ n ^ 
/ 



with 



0"G 
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Using the change of variable 



2:! 



£ 



e 



2 r* 



T*\\2 *\/ I 0, Cv 



log(p)" 



we obtain 



E 



max (w, V | T* 

weT 1 — ' 



(A.37) 
where 



< 24 r* 



T* \\2 S1 



a (1 — e x ) \ " 



log(p) 



v-l 



C*r, 



A. 4. 3. Last step of the proof. Combining (|A.36P and (|A.37j) . we obtain 



|£*ll>24 r* max 



T* \\2 S\ 



a (1 — e 



-in n 



&*C X J Vlog(p)"-V 1 



C) 



(A.38) 
with 



7 = n 



i * 



+ E 



3*»|| 2 (^2 U7 * + -J | J 7 * I <exp(-u), 



M, 



max 



6. u; 



J 7 * 



6,||tu||2=l /Vax HPr* 112 



Therefore, taking u = alog(p) we have 



|5*||>24 < ax ||/?Vll2 5 1 



a (1 — e 



-in tt 



+ ^/2alog(p)(L*) + 



alog(p) 



1 



C* f 



logipy- 1 ) s 

// ilia x ! ll^lla I ^1 <P~ a , 



with 



n- 



* ||2 

T* \\2 



i?* C- 



Using Assumption 12,91 we obtain 



|5*|| > (24 r^ axS1 



a (l-e- 1 ] 



1 



log(p) 



i/-i 



C* 



+/4 ax alog(p)) || J, ||, j nr". 
Using the same trick as before, we obtain 



> (24 r* max si 



a (1 — e 

Cy 



-D 



1 



log(p) 



V-l 



+/x* nax alog(p) ||/3£.|| 2 < 



P 
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Notice further that 

W* T *h < (1 + ^)11^^112 
= {1 + P £)\\£tPt\\2, 

by definition of /3*. Thus, we obtain that 



|F*|I> (24 r* m3JL s] 



a (I- e- 1 ) 



log(p) 



CI 



+^ma x alogO) ) ||£t/3t||2 J < — ■ 



as desired. 

A. 4. 4. Proof of Lemma \A.%\ . Let us first notice that since | 



\Ej,* ||, we can write 



\Ej^* | 



|£;t*-Et 



Ei*E l - 



j*er* 



This latter expression is well suited for our problem, since it is the norm of the sum of independent 
positive semi-definite random matrices, for which the Matrix Chernov inequality of Section fB . 31 applies . 
In order to apply this inequality, we need a bound on the norm of each summand. By Lemma lA.ll 
on £*, we have 



\Ej*E]* 



\ E i*\\-2 

i a (1 - e _1 )\ ™ 
< s 2 n 1 ] 



1 



i9* C x J \\ogipy- 1 
We also need a bound on the norm of the expectation. We have 



E 



E E r E h I n 



]T E [ E j* E j* I 



Due to rotational invariance, we have that the law of Ej* is the same as the law of D(()Ej*, where 
Clj • • • jCti are i-i-d. Rademacher ±1 random variables independent from Ej*. Thus, 

E [(i E i,j*(i' E i',j* \ £ a ] = E [E [C i E i j*C i >E i 'j* I Eij*, Ei>j* | £*]] 



(A.39) = 0. 

On the other hand, we have the following result. 
Lemma A. 4. We have 

2 fa (1 -e" 1 )^" 



E[E\ 



1 



\v-l 



#*c x j \io g ( P y 

Proof. Due to rotational invariance of the law of Ej* and the event £*, we have 

E K**K] =•••= E K;*K]- 

Therefore, 



1 



E 



n 



U'=l 



MIXTURE MODEL FOR DESIGNS IN HIGH DIMENSIONAL REGRESSION AND THE LASSO 



25 



and by the definition of £*, 



a (1 — e : )\ n 



log(p) 



□ 



Based on this lemma, and the fact that the matrix 

e E r E r I K 

is diagonal by (|A,39p . we obviously obtain that 



E 



E E ^ E r I K 



a (1-e- 1 ; 



\og(p) v 1 



With the bound on the norm of the expectation and on the variance in hand, we are now ready to 
apply the Matrix Chernov inequality and obtain 



E E r E )* 



>u\Tl 



< n 



1 



log(p)" 



1 



, log(p) 



T7=T 



U 



Let us finally tune u so that the right hand side term is less than p a , i.e. 

i , ,i 



log(n) + log 



( es 2 ( a(l-e-^) \n, i 

- 1 #*C X \ log(p)"-i 



< -Q 



s 2 n / a (1-e ') \~ 



0» C v 



1 



Iog(p)"- 



U 



log(p). 



Take 
(A.40) 



> /a (1 — e 



log(p) 



log(p). 



Since, by assumption, p > e e2 log a ; we have — log(log(p)) + log(e/a) < —1. Moreover, the value of u 
given by (|A.40p is less than or equal to s 2 s * with K njS * given by f|2 . 8[) . This completes the proof. 

A. 4. 5. Proof of Lemma \A.3l Independence of the Ej*, j* S T* allows to write 

Var (M b % | T*) = £ ^Var [E]*w \ T*) , 

and, using the Cauchy-Schwartz inequality again, we obtain 
Var(M 6 *J^) = J!%l!* ' 



/ Var2 [E^w I J* 

j*eT* 



On the other hand, notice that, due to rotational invariance of the Gaussian measure, conditionally 

and 

Var (E^w | J 7 *) 



on J 7 *, Ej*w is centered and 



E 
E 



((OvDiQE^wf | T* a 
{E),D{QO w wf | K 
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where ( is a rademacher ±1 random vector, O w is the orthogonal transform which sends w to the 
vector 1/y/ne, where e is the vector of all ones. Thus, 



Var (E}*w \ J 7 *) = —E [e \{E], D(Qe) 2 | E, T* a 



Moreover, 



E 



{E]*D{Q)e) 2 \E,K 



E 



E,T* 



si=l 



and expanding the square of the sum gives 



E 



(E t j .D(QeY\E,F a 



Ea* 



J* 112- 



2 

' max ' 



Using the bound on b, we finally obtain 

Var(M^|^) < < 
where a max 2 is given by (|2.9p . 

Appendix B. Norms of random matrices, £-nets and concentration inequalities 
B.l. Norms and coverings. 

Proposition B.l. ([221 Proposition 2.1]). For any positive integer d, there exists an e-net of the unit 
sphere of M. d of cardinality 



2d[l + ~ 



d-1 



< 



The next proposition controls the approximation of the norm based on an e-net. 

Proposition B.2. ([22, Proposition 2.2]). Let M be an e-net of the unit sphere ofM. d and let N' be 
an e' -net of the unit sphere ofM. d . Then for any linear operator A : M. d \— > M. d , we have 

\\A\\ < - r-r: j- sup ^Awl 

(1 - e)(l - e') veAr 

B.2. The Matrix Hoeffding Inequality. A Non-commutative version of the famous Hoeffding 
inequality was proposed in |25| . We recall this result for convenience. 

Theorem B.3. Consider a finite sequence (Uj)j^T of independent random, self-adjoint matrices with 
dimension d, and let (Uj)j£T be a sequence of deterministic self-adjoint matrices. Assume that each 
random matrix satisfies 

E[Uj] = and Uf ± V? a.s. 
for all j £ T. Then, for all u > 0, 



At, 



^f/j) > t j < d exp I - 



B.3. The Matrix Chernov inequality. The following non-commutative version of Chernoff's in- 
equality was recently established in [25J. 

Theorem B.4. (Matrix Chernoff Inequality [25]) Let X\,. . . ,X p be independent random positive 
semi-definite matrices taking values in M. dxd . Set S p = X)?=i Xj. Assume that for all j 6 {1, . . . ,p} 



\Xj || < B a.s. and 



Then, for all r > e /i r 



|E S p \\ < n T 



\S P \\ >r)<d 



e 



r/B 



(Set r = (1 +<S)/x max and use < in Theorem 1.1 
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B.4. Gaussian i.i.d. matrices. The following result on random matrices with Gaussian i.i.d. entries 
can be found in |27l Corollary 5.35]. 

Theorem B.5. Let G be annxm be a matrix whose entries are independent standard normal random 
variables. Then for every u> 0, with probability at least 1 — 2 exp (— u 2 /2j, one has 

y/n-^/m-u< cr min (G) < <7 max (G) <^/n + ^/m + u. 

B.5. Talagrand's concentration inequality for empirical processes. The following theorem, 
which a version of Talagrand's concentration inequality for empirical processes, was proved in [¥] 
Theorem 2.3]. 

Theorem B.6. Let Xj be a sequence of i.i.d. variables taking values in a Polish space X, and let T 
be a countable family of functions from X to R and assume that all functions f in T are measurable, 
square integrable and satisfy E [/] = 0. // sup j g jr ess sup / < 1, then we denote 

n 

Z = supV/pQ). 

Let <7 max be a positive number such that <T^ ax > supy G j- Var {f{X\)) almost surely, then, for all u > 0, 
we have 

p(z>E[Z] + ^/2^+|) < exp(- 
withj = na 2 a3X + E[Z]. 

B.6. Dudley's entropy integral bound. Let (7~, d) denote a semi-metric space and denote by 
H(5, T) the (5-entropy number of (7~, d) for all positive real number 8. 

B.6.1. The Gaussian case. Let (Gt)teT ^ e a centered gaussian process indexed by T and set d to be 
the covariance pseudo-metric defined by 



-u 



d(t,t') = [{G t -G t ,f]. 
Then, we have the following important theorem of Dudley, which can be found in the present form in 



Theorem B.7. Assume that (T,d) is totally bounded. If w H(5, T) is integrable at zero, then 



E 


sup G t 


< 12 / 




_teT _ 


Jo 



where 



a 2 G = supE[G t 2 ]. 



B.6.2. The sub-Gaussian case. We start with the definition of sub-Gaussian processes. 

Definition B.8. A centered process (St)teT * s sa ^d t° be sub-Gaussian if for all (t,t') G T 2 , and for 
all u > 0, 

u 2 

' {\X t - X t i\ > u) < 2 exp' 



d 2 (t,t') J ' 



One easily checks that a Gaussian process is sub-Gaussian with the covariance semi-metric in the 
former definition. Let (St)t^T De a centered sub-Gaussian process. We then have the following 
standard result. 



Theorem B.9. Assume that (T,d) is totally bounded. If \J H(5, T) is integrable at zero, then 

^diam(7") 



E 

for some positive constant C 'chain- 



sup St 
teT . 



/■maun / J 
C ch ain / VH(S,T) dd 

Jo 
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Appendix C. Verifying the Candes-Plan conditions 

The goal of this section is to Proposition 13. II which gives a version of Candes and Plan's conditions 
adapted to our Gaussian mixture model. 

C.l. Important properties of <t. The invertibily condition for (|3,14p is a direct consequence of |24| . 
An alternative approach, based on the Matrix Chernov inequality is proposed in [13J, with improved 
constants. We have in particular 

Theorem C.l. P21 Theorem 1] Let r £ (0,1), a > 1. Let A s sumptions 1 2. 3 and \2.J\ hold with 
(C.41) C spar > 



4(l + a)e 2 ' 

With fC C {1, . . . , K} chosen randomly from the uniform distribution among subsets with cardinality 
s* , the following bound holds: 

(C42) P(||£^-Id s ||>r) < ^. 

Moreover, the following property will also be very useful. 
Lemma C.2. (Adapted from |13[ Lemma 5.3]J If v 2 > e s* \\€-\\/K 0! we have 

s* \\<t\\ 2 \T^) 7 



P (max Hct-Cdl > — < K ( e — ^| ^ 
Based on this lemma, we easily get the following bound. 



Lemma C.3. Take C co \ > e (a + 1) max{ a/ C spar , }/ (1 — r). Then, we have 

pfmaxll^CJI > Cc ° \ 1 < — . 



Proof. Taking v = C co i/ y / log(p), we obtain from Lemma IC.2I 



110" il C "* \ < K ( S * llgf M p)^ 10S(P) 
max CrCjt > . < K \ e ^ 



Using (|2.4p . this gives 



max||4£j|> Cc °\ I < K (e^P-) C 



-,2 

«2i log(p) 



Since C cot > e 2 (a + 1) max{ v / C spar , C M }, we get 



c. 



2 



. l g(p) / P \ [ °S(P) 

" 1 ; - m«+i 



and since, by Assumption I2.1[ < p, we obtain that 

pfmaxIKeJI > ^ ) < — • 



□ 



C.2. Similar properties for Xt* 
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C.2.1. Control of \\X^X T * - We have 

a min (X^ X T * ) = a min + St*) * D\ (£ Ct , + 

where (see Step 1 in the proof of Proposition 13, 2p is a diagonal matrix whose diagonal elements 
are indexed by T* and are defined by 

1 



for j* 6 T*. By the definition of £*, we have 

0"min(-D*) > 



1 +5\ n 



. i 

q(l-e- 1 ) 



and 

0"max(A.) < 



1 — s\ n 



q(l— e" 1 ) \ n / 1 



1?, C x ; ,log(p)"-\ 

By the triangular inequality, 



1 -r 

> 



(l+r) ||£ T .|I + I|£t.|| 2 



and 



1 M/ " 1 c x i \io g ( P y 



a max {X^Xt*) < W^D^W + \\C K DlE T * || + 

< (l + r) + (l + r) ||ffr.|| + ||ffr.|| 2 



1 - s » / "' 2 Sr^ i )"(i55^ 



Moreover, using Theorem IC.ll and Lemma IA.2| we obtain 

218 



■(||A^Jf r .-j||>r*K) < jf 



with r* given by 



max ■ 



(l + r) + (l + r) sK nyS * +5 z Ki 



1 — 5\ n 



— i 

a(l— e _1 ) \ n / i 



(C.43) 



0. c x ; V lo g(p)" 

1 -r 



1 + s\/n 



a(l-e- 1 )' 



0, c x j v io g(p)" _1 

(l+ r) 5 K n ^+^Kl s , 
a{l-e- l )\ n / i 
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Using Assumption (|2.7p . we have 



and thus, by Assumption 12. 7| 



i . i 

Q (1— e _1 ) \ n / i 



1 + w^l°g(f) 



1 J 

a (1 — e _1 ) \ n f 1 



< 0.1 -r. 

On the other hand, 



< 



C s , n ,p f a(l-e 



l 



\/log(p) V V ^* Cx / v lo g(p) 



which, by Assumption 12. 7| gives 



«)»r a(1 .-r' ) W. A .1* 01 - r 



Summing up, we get 



??*c x y Vi°g(p)" -1 / " iog(p) 



(l + r) + (l + r) 0.1 • r + 0.01 • r 2 | 

log(p) 

r 2 ) 



< (l.l-r + 0.11- 

+2 (1 + 1.1 -r + 0.11 -r 2 ) — ^ 

log(p) 

and by Assumption 12.10"! 

r* < 1.1 t (1.1 + 0.11 t). 

Thus, using Lemma lA.ll 

218 + 1 



|X^,X r * - I > 1.1 • r(l.l + 0.11 • r)) < 



V 



C.2.2. Control o/max^gyc ||AT^*Xfc|| 2 . By the triangular inequality, we have that 
max || X f T * X k || 2 = max 1 1 {£ K + E T * )* D 2 (£ fe + £ fe ) 1 1 2 

" f^ffcll^ll + ll^ll^ H^ll2 

+ ||£ T *|| max p fc || 2 Y|Z>, || 2 . 
A computation analogous to the one for the probability of £ a gives that 



fc maX |rt | M2 >»U + ^log W K < £ 
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Thus, using Lemma IC.3I and Lemma IA.21 we obtain 



Pi m^J|X^|| 2 >-^ + (l + r + ^ niS 05(^ + J— log(p)J \£*) < ?±2. 
Since, by Assumption (|2.7|) . 



we obtain 



(l + r + sir n>s .)5( v^+\/ — log(p)) < (1 + l.lT)-^, 

c y viog(p) 

Moreover, using Lemma lA.ll we obtain 

P Lax ||Xj,X t |l > <^ + (l+Hl'-)Ci W ) < £±3. 

C.3. The last two inequalities. The proof of (|3.16p is standard and, under Assumption 12, 7[ the 
proof of (|3.17p can be proved using the ideas of [9l Section 3.3]. We give the proofs for the sake of 
completeness. 

C.3.1. Control of \\X^ C X T * (X!fcX T .)- 1 X!frz\\ go . For any j G T* c , we have 

1 / u 2 \ 

F (X* X T * (Xtfr X T * ) ~ 1 X l T , z > u) < - exp 



2 \ 2cr 2 IIXt^x^Xt.)- 1 ^^ 



i2 

Jib- 



Taking u such that 



1 / v 2 \ C + 219 + 3 

- 2 6XP 1 ~o^2 1+r* (C eo , + (l+l.l-r) C 5 ,n, P ) 2 j + ^ 
\ L(J (l-r*y log(p) / 



1 / U 2 \ 1 

exp 



i.e. 



9rr 2 1+r* (C coi +(l+l.l-r) C s ,n.p)' J p° 
ZfJ (1-r*)* log(p) ' 



/ i / a , .0^02 1 + r * (C C oi + (1 + 1-1 • r) C7 SjniP ) 2 
(alog(p) — log(2)) 2cH 



(1 _ r *)2 log(p) 
Using the union bound, we finally obtain 



IX^Xt^Xt^X^zII > t/(alog(p) - log(2)) 2<r* 1 + r * (C ">< + (1 + L1 " r) ^ 



.2 



(1 _ r *)2 log(p) 



„ C + 223 
- pa-l 



C.3.2. Control of \\X^,cXt*(X^Xt*) 1 sgn (/9r*)|loo- Hoeffding's inequality gives 
P(XjX T ,(X^X T ,)- 1 sgn(/3^) >u) < ^exp 



1 / u 2 



2 \\(X! t .X T .)- 1 X*«X j \\l t 
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Choosing 

(Ccoi + (1 + 1.1 • r) C„, n . 



u = W(alog(p)-log(2))2 



log(p) (1 — r*) 2 
and applying the union bound, we obtain 

P \xtX T *(Xt T ,X T *)-h g n(^) > ^ (alog(p) - log(2)) 2 ±| + j^ffi ^ 

C.3.3. Summing up. Using Assumption 12.61 we obtain that 

\\X t T ,cXT*{X t T ,X T *y l X t T ,z\\ +A WX^cXt^X^Xt*)' 1 ^^] 



C + 223 

poc-l 



< a^l + 1.1 • r (1.1 + 0.11 • r) + - A 



as announced. 
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